AITopics | bandit policy

Collaborating Authors

bandit policy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Differentiable Meta-Learning of Bandit Policies

Neural Information Processing SystemsDec-23-2025, 19:22:27 GMT

Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution P. In this work, we learn such policies for an unknown distribution P using samples from P. Our approach is a form of meta-learning and exploits properties of P without making strong assumptions about its form. To do this, we parameterize our policies in a differentiable way and optimize them by policy gradients, an approach that is pleasantly general and easy to implement. We derive effective gradient estimators and propose novel variance reduction techniques. We also analyze and experiment with various bandit policy classes, including neural networks and a novel softmax policy. The latter has regret guarantees and is a natural starting point for our optimization. Our experiments show the versatility of our approach. We also observe that neural network policies can learn implicit biases expressed only through the sampled instances.

bandit policy, differentiable meta-learning, name change, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.51)

Add feedback

The goal of our work was easy reproducibility and clearly showing the benefits of learning to explore over the state

Neural Information Processing SystemsOct-9-2025, 13:22:34 GMT

Thank you for the reviews of our paper. We will revise the paper accordingly. We discuss a contextual extension in Section 8. The policies for longer horizons also perform well and outperform TS. This can be seen in the proof in Appendix C, which only requires that γ = 1 /θ [1 /8, 1).

artificial intelligence, bandit policy, gradband, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.30)

Add feedback

Learning to Optimize Feedback for One Million Students: Insights from Multi-Armed and Contextual Bandits in Large-Scale Online Tutoring

Schmucker, Robin, Pachapurkar, Nimish, Bala, Shanmuga, Shah, Miral, Mitchell, Tom

arXiv.org Artificial IntelligenceAug-4-2025

We present an online tutoring system that learns to provide effective feedback to students after they answer questions incorrectly. Using data from one million students, the system learns which assistance action (e.g., one of multiple hints) to provide for each question to optimize student learning. Employing the multi-armed bandit (MAB) framework and offline policy evaluation, we assess 43,000 assistance actions, and identify trade-offs between assistance policies optimized for different student outcomes (e.g., response correctness, session completion). We design an algorithm that for each question decides on a suitable policy training objective to enhance students' immediate second attempt success and overall practice session performance. We evaluate the resulting MAB policies in 166,000 practice sessions, verifying significant improvements in student outcomes. While MAB policies optimize feedback for the overall student population, we further investigate whether contextual bandit (CB) policies can enhance outcomes by personalizing feedback based on individual student features (e.g., ability estimates, response times). Using causal inference, we examine (i) how effects of assistance actions vary across students and (ii) whether CB policies, which leverage such effect heterogeneity, outperform MAB policies. While our analysis reveals that some actions for some questions exhibit effect heterogeneity, effect sizes may often be too small for CB policies to provide significant improvements beyond what well-optimized MAB policies that deliver the same action to all students already achieve. We discuss insights gained from deploying data-driven systems at scale and implications for future refinements. Today, the teaching policies optimized by our system support thousands of students daily.

assistance action, machine learning, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

2508.0027

Country:

North America > United States (1.00)
Europe > United Kingdom > England (0.45)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Instructional Material > Course Syllabus & Notes (1.00)
Research Report > Strength High (0.93)

Industry: Education > Educational Technology > Educational Software > Computer Based Training (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Add feedback

Quick-Draw Bandits: Quickly Optimizing in Nonstationary Environments with Extremely Many Arms

Everett, Derek, Lu, Fred, Raff, Edward, Camacho, Fernando, Holt, James

arXiv.org Machine LearningJun-2-2025

Canonical algorithms for multi-armed bandits typically assume a stationary reward environment where the size of the action space (number of arms) is small. More recently developed methods typically relax only one of these assumptions: existing non-stationary bandit policies are designed for a small number of arms, while Lipschitz, linear, and Gaussian process bandit policies are designed to handle a large (or infinite) number of arms in stationary reward environments under constraints on the reward function. In this manuscript, we propose a novel policy to learn reward environments over a continuous space using Gaussian interpolation. We show that our method efficiently learns continuous Lipschitz reward functions with $\mathcal{O}^*(\sqrt{T})$ cumulative regret. Furthermore, our method naturally extends to non-stationary problems with a simple modification. We finally demonstrate that our method is computationally favorable (100-10000x faster) and experimentally outperforms sliding Gaussian process policies on datasets with non-stationarity and an extremely large number of arms.

bandit policy, data mining, machine learning, (20 more...)

arXiv.org Machine Learning

doi: 10.1145/3711896.3737097

2505.24692

Country:

North America > Canada > Ontario > Toronto (0.05)
North America > United States > Virginia > Fairfax County > McLean (0.04)
North America > United States > Maryland > Baltimore (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.68)

Add feedback

Review for NeurIPS paper: Differentiable Meta-Learning of Bandit Policies

Neural Information Processing SystemsJan-22-2025, 01:19:22 GMT

As in standard policy-gradient methods, it seems that two key parameters are the batch-size m and the horizon n. It would be good to provide some sensitivity analysis on these parameters to better assess how the approach scales to complex problems. In particular, what is the effect of the horizon on the gradient estimation? Does the variance blow up or is the baseline sufficient to keep it under control? In this sense, it might be good to have differentiable strategies that are provably efficient (e.g., with sub-linear regret) for a range of parameter values, so that whather value of \theta we encounter during its optimization will not performed poorly.

bandit policy, differentiable meta-learning, neurips paper, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Review for NeurIPS paper: Differentiable Meta-Learning of Bandit Policies

Neural Information Processing SystemsJan-22-2025, 01:19:15 GMT

The rebuttal helped clarify the questions raised in the review. The consensus reached in the discussion is that this is a borderline-plus paper. The reviewers appreciate the contribution's practicality, relevance and usefulness, and at the same time they do remain concerned about the narrow scope, and would rather have seen the policy-gradient method applied to parameterized policies for more complex learning problems. On the whole, this is a worthwhile addition to the program. The rebuttal did not answer one question successfully, namely regarding the setup in the experiments section, where the learning process operating at two-levels remained confusing.

bandit policy, differentiable meta-learning, neurips paper, (1 more...)

Neural Information Processing Systems

Industry: Education > Focused Education > Special Education (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Differentiable Meta-Learning of Bandit Policies

Neural Information Processing SystemsOct-9-2024, 14:53:45 GMT

bandit policy, differentiable meta-learning

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.36)

Add feedback

HyperBandit: Contextual Bandit with Hypernewtork for Time-Varying User Preferences in Streaming Recommendation

Shen, Chenglei, Zhang, Xiao, Wei, Wei, Xu, Jun

arXiv.org Artificial IntelligenceAug-14-2023

In real-world streaming recommender systems, user preferences often dynamically change over time (e.g., a user may have different preferences during weekdays and weekends). Existing bandit-based streaming recommendation models only consider time as a timestamp, without explicitly modeling the relationship between time variables and time-varying user preferences. This leads to recommendation models that cannot quickly adapt to dynamic scenarios. To address this issue, we propose a contextual bandit approach using hypernetwork, called HyperBandit, which takes time features as input and dynamically adjusts the recommendation model for time-varying user preferences. Specifically, HyperBandit maintains a neural network capable of generating the parameters for estimating time-varying rewards, taking into account the correlation between time features and user preferences. Using the estimated time-varying rewards, a bandit policy is employed to make online recommendations by learning the latent item contexts. To meet the real-time requirements in streaming recommendation scenarios, we have verified the existence of a low-rank structure in the parameter matrix and utilize low-rank factorization for efficient training. Theoretically, we demonstrate a sublinear regret upper bound against the best policy. Extensive experiments on real-world datasets show that the proposed HyperBandit consistently outperforms the state-of-the-art baselines in terms of accumulated rewards.

artificial intelligence, hyperbandit, recommendation, (13 more...)

arXiv.org Artificial Intelligence

2308.08497

Country:

Asia > China > Beijing > Beijing (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)

Add feedback

Differentiable Meta-Learning in Contextual Bandits

Kveton, Branislav, Mladenov, Martin, Hsu, Chih-Wei, Zaheer, Manzil, Szepesvari, Csaba, Boutilier, Craig

arXiv.org Machine LearningJun-9-2020

We study a contextual bandit setting where the learning agent has access to sampled bandit instances from an unknown prior distribution $\mathcal{P}$. The goal of the agent is to achieve high reward on average over the instances drawn from $\mathcal{P}$. This setting is of a particular importance because it formalizes the offline optimization of bandit policies, to perform well on average over anticipated bandit instances. The main idea in our work is to optimize differentiable bandit policies by policy gradients. We derive reward gradients that reflect the structure of our problem, and propose contextual policies that are parameterized in a differentiable way and have low regret. Our algorithmic and theoretical contributions are supported by extensive experiments that show the importance of baseline subtraction, learned biases, and the practicality of our approach on a range of classification tasks.

cosoftelim, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

2006.05094

Country:

North America > Canada > Alberta (0.14)
South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > New York (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
Information Technology > Data Science > Data Mining > Big Data (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

Add feedback

Differentiable Bandit Exploration

Boutilier, Craig, Hsu, Chih-Wei, Kveton, Branislav, Mladenov, Martin, Szepesvari, Csaba, Zaheer, Manzil

arXiv.org Machine LearningFeb-17-2020

We learn bandit policies that maximize the average reward over bandit instances drawn from an unknown distribution $\mathcal{P}$, from a sample from $\mathcal{P}$. Our approach is an instance of meta-learning and its appeal is that the properties of $\mathcal{P}$ can be exploited without restricting it. We parameterize our policies in a differentiable way and optimize them by policy gradients - an approach that is easy to implement and pleasantly general. Then the challenge is to design effective gradient estimators and good policy classes. To make policy gradients practical, we introduce novel variance reduction techniques. We experiment with various bandit policy classes, including neural networks and a novel soft-elimination policy. The latter has regret guarantees and is a natural starting point for our optimization. Our experiments highlight the versatility of our approach. We also observe that neural network policies can learn implicit biases, which are only expressed through sampled bandit instances during training.

bandit, gradient, softelim, (17 more...)

arXiv.org Machine Learning

2002.06772

Country:

North America > Canada > Alberta (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback